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Abstract 

In this paper we present some extensions to the k-means algorithm for vector quantization 
that permit its efficient use in image segmentation and pattern classification tasks. It 
is shown that by introducing state variables that correspond to certain statistics of the 
dynamic behavior of the algorithm, it is possible to find the representative centers of 
the lower dimensional manifolds that define the boundaries between classes, for clouds of 
multi-dimensional, multi-class data; this permits one, for example, to find class boundaries 
directly from sparse data (e.g., in image segmentation tasks) or to efficiently place centers 
for pattern classification (e.g., with local Gaussian classifiers). The same state variables 
can be used to define algorithms for determining adaptively the optimal number of centers 
for clouds of data with space-varying density. Some examples of the application of these 
extensions are also given. 
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This scheme suffers from some limitations: in the first 
place, it is difficult to analyze (except in some particular 
cases [25]), and thus to understand its performance in a 
precise way; besides, the neighborhood structure is im¬ 
posed rather than found from the data (although some 
modifications have been proposed to this end [21]), which 
limits its usefulness in unsupervised clustering tasks. 

The above considerations provide the motivation for 
the present work: it is our purpose to extend the LKMA 
so that some of its limitations are overcome; specifically, 
we will propose extended versions of the algorithm that: 

i) Allow for a rigorous analysis of its convergence 
properties. 

ii) Work well for clustered data. 

iii) Will, if desired, find the centers of the inter-class 
boundary set. 

iv) Adapt the number of centers to the local spatial 
density of the data. 

v) Find the “natural” neighborhood structure for the 
centers of a data set (i.e., may be used for unsu¬ 
pervised clustering). 

The plan of the presentation is as follows: in sec¬ 
tion 2, we introduce a family of algorithms that include 
the LKMA as a limiting case, and give a general con¬ 
vergence theorem for this class; also in this section, we 
present some basic extensions and generalizations needed 
for finding the centers of the inter-class boundary set, 
and for adapting the number of centers to the data den¬ 
sity. In section 3 we give examples of the application 
of the extended scheme, specifically, to image segmenta¬ 
tion and pattern classification, and finally, in section 4, 
we present some conclusions and open problems. 

2 Extended Local K-Means Algorithms 

First, we will introduce a generalization of the LKMA 
that will help us to understand its convergence proper¬ 
ties. Consider again a set X — {aq,..., x^} of points 
in IT 0 , a set of centers {mi,..., tum }, and the weighted 
error measure: 

1 N M 

£p( m ) = 9 ~ m kf w ik ( 6 ) 

i = l k = 1 

where 

w p = ex p[-/%«- m *ii 2 ] m 

ik E^expMUx.-m.-IH 

It is clear that the error measure (2) may be obtained 
as the limit of (6) as f3 —» oo. Now, £p is differentiable, 
and its gradient with respect to is given by: 

N 

VkSp = J2(m k - Xi)Wf k ( 8 ) 

2 — 1 

where 
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A standard gradient descent procedure for minimizing 
£p would therefore take the form: 

= m* } “ a t J2( m{ k } ~ x i) w ik ■ (9) 

2 — 1 

for some sequence {a*} converging to zero. However, it 
is often more convenient to adopt a stochastic gradient 
descent algorithm for minimizing £p, that is equivalent 
to approximating, at each step, the sum on the right side 
of eq. (9) with just one term, randomly drawn among 
the N terms. In formula: 

m i‘ +1) = m i t) - - *« t) w lk 0°) 

where {£*} is a sequence of random variables which take 
values on {1,2,..., N}. This minimization technique is 
especially convenient when the data points come one 
at a time, and it has been extensively used by the neural 
network community, as a part of the so called “back- 
propagation” procedure for neural networks training. In 
the limit as —► oo, (10) becomes precisely the LKMA 
(3). The convergence of (10) to a local minimum of £$ 
follows from the following lemma (the proof is given in 
the appendix; see also [28] for closely related results): 

Lemma 2.1 Let F(y) : R^ i—► R be of the form: 

1 N 

nv) = My) + j^2fi(y) 

2 — 1 

where fi are differentiable functions whose gradient is 
bounded and satisfies the following Lipschiz condition: 

IIWt(y) - V/,V)|| < M\\y — 2/11 i = 0,..., N 
for some positive number M. Let {y n } be the sequence 

Un +1 Vn ftiiLJri) ^^nf]n 00 

where {£ n } is a sequence of random variables that take 
values in {1 ,...,jV} with uniform probability distribu¬ 
tion, {a n } is a sequence of scalars satisfying: 

oo oo 

I]a n =oo and £ a 2 < °° 

2=1 2—1 

and {r) n } is a sequence of random variables that is 
bounded and converges to zero with probability one. As¬ 
sume that {y n } is bounded and that S is a locally asymp¬ 
totically stable point of the ordinary differential equation: 

y = _VF (12) 

with domain of attraction As . Then, if y n £ G for all 
n , for some compact domain G C As, {y n } converges to 
S with probability one. 

To satisfy the conditions of the lemma, the sequence 
{a*} in (10) must be chosen appropriately. One possible 
choice is, for example, 
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a t = l/t . 



else if hk > 0 S N, generate a new center at a loca¬ 
tion corresponding to one the data points inside the 
current Voronoi polytope of center k. 

where 0;,0 5 E [0,1] are two suitably chosen thresholds 
such that ( 0 S — 6{)N > 1. Note that this choice for 
the location of the new centers is necessary to ensure 
convergence (see below); in practice, however, one may 
simply locate them at m*. + er where r is a random unit 
vector, and e a positive number small enough, so that 
the new center is attracted by at least one data point 
inside S k . 

If the total number of centers change after a sweep, 
one should reset h k to zero for all Ar, and a t to a t _yv, and 
effect a new sweep until the number of centers stabilize. 

It is not difficult to show the convergence of (18) when 
the lower threshold 0i is set to 0: in this case, one starts 
with one processor (center) and successively generate 
new ones until the Voronoi polytopes corresponding to 
all centers contain less than 0 S N data points. Since in 
this case the number of centers increases monotonically 
and it is bounded above, (e.g., by the total number of 
data points), it will necessarily converge to a fixed num¬ 
ber M*. 

0 S is a free parameter that controls the expected av¬ 
erage number of points per center, while 0 l controls the 
variance of this number (a small variance is obtained if 
(0 S — 0{) is small). The fact that one has control over 
the variance means that one can generate more uniform 
center distributions with this method than with the stan¬ 
dard k-means scheme. This, in turn, will usually improve 
significantly the performance of other procedures that 
may use these centers, for example, for vector quantiza¬ 
tion or for function approximation (see section 3). 

In practice, it is convenient to set the lower thresh¬ 
old to a positive value to prevent centers to be attracted 
to single outlier data points, as well as the existence of 
centers with empty Voronoi poly topes (which may hap¬ 
pen due to random initialization, if one starts with more 
than one center). Note, however that convergence can¬ 
not be guaranteed in this case; consider the following 
example: suppose we have N = 7 data points; N0 S = 5 
and NOi = 4. It is clear that the number of centers 
generated by (18) will always oscillate between 1 and 2. 

This kind of pathological situations are unlikely to 
occur in practice though, specially if (6 S — 0i) is large 
enough (say, if 0 S > 30;). A practical way of ensuring 
convergence in any case is to let 0; go to zero after a 
fixed number of iterations. 

2.2.2 Boundary Finders 

In the case of multi-class data, the augmented state 
may be used to find the inter-class boundaries directly 
from sparse data. 

Let us assume that we have class-specific centers for 
classes 1 through M — 1. Consider the 2-class case first: 
the augmented state will contain, for each center k , the 
vector (rrik, hk,h\k) (we only have one type of centers 
in this case). The idea is to constrain the update rule 
so that a center position is updated approximately the 
same number of times by data points belonging to each 


class, so that at all t , 
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The boundary-finding update rule may thus take the 
form: 


m i <+1) = m^ + atixi -m^), (19) 

if xt £ S^ and < ifcjj. 0 

— , otherwise 

where x; is the data point chosen at time t. 

It is clear that with this rule we will have, at any time 

C | h y k — 2 h\f. |< 1, so that there will be approximately 
the same number of data points belonging to each class 
inside the Voronoi poly tope of each center, provided that 
the data density is uniform. In this case, upon conver¬ 
gence, every center k will be located at about the mid¬ 
point of the centroids of the sets {C 0 nSfc} and {C\ HS*} 
where C n is the set of data points of class n. 

To see why this is true, note that when the update 
rule (20) reaches its steady state, we must have that 

2 

E[(xi - m k )} - ^ X] ~ ™ k )P k (i, c) = 0 

c—\ x t £S k 

where Pjt(i, c) is the probability of selecting an example 
i that is in Sk and belongs to class c and E[-] denotes 
the expected value. Now, 


Pk{h c ) — Pr(select i | i E C c 0 Si t) Pr(select C c ) « 

1 1 

rv ~ ‘ r "" * ~ 

I c c n s k | 2 

where | C c H Sk | denotes the number of points of class 
c inside S*, so that 
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It is in this sense that one may say that the centers 
are representative samples of the inter-class boundary 
set. 

This procedure may be generalized to Q > 2 classes, 
by sampling, for each class {1,. .. , Q - 1}, the boundary 
between itself and all the other classes. This however, is 
not very efficient, since many parts of the boundary will 
be sampled several times. A more economical sampling 
may be obtained by defining the sets: 



T\ — &k +1 U £k +2 u . . . Cq 

for k ~ 1,... Q — 1, and finding, for each k, the centers 
{mfci,..., m kMk } that sample the boundary between the 
sets 7^ and T* using algorithm (20). 

It is of course possible (and desirable) to combine this 
procedure with the one for finding the number of centers 
in an adaptive way. Figure 3 shows the performance of 



_Figure 4 around here_ 

A more efficient way of achieving the same result is to 
define a “pyramid” of processes that operate from coarse 
to fine scales, and that increase the number of centers at 
each refining step: one may start with a 3 x 3 lattice , 
which after a few iterations of (22) may be refined by 
adding intermediate centers whose initial positions cor¬ 
respond to the midpoints of existing ones (see figure 5), 
and repeat the whole procedure recursively until the de¬ 
sired number of centers is obtained. Note that with this 
procedure, the potentials are always of the form (23) 
with k = 1. 


-Figure 5 around here_ 

The final configurations obtained in this way (see fig¬ 
ure 4-b) are very similar to those obtained with Koho- 
nen’s algorithm [15] (which also incorporates long range 
interactions), but since the neighborhood size remains 
fixed, the computational complexity is lower, and since 
the new centers are already close to their correct (glob¬ 
ally ordered) positions, the convergence rate is signifi¬ 
cantly faster. 

It is convenient, as in the case of Kohonen’s scheme, 
to mantain a fixed, relatively large a t in (22) until the 
final number of centers is reached. At this point, the use 
of an appropriately decreasing sequence guarantees the 
final convergence to a local minimum of (21) (see lemma 
1). Other examples of the use of this approach will be 
given in the next section. 

3 Applications 

In this section we present some applications that illus¬ 
trate the power of the techniques we have developed 

3.1 Edge Detection from Sparse Data 

Suppose we have a set of sparse data points X = 
{xi inside the unit square Q in R 2 , which may 

belong to either one of two classes {0,1}, and suppose 
there is a closed region A C D whose boundary is a 
closed, smooth curve, and such that 

C(x) = 1 x E A 

(i.e., all the data points in class 1 are inside A). The 
problem now is to find a polygonal line (i.e., a sequence 
of points {mi,... rajvf}) that lies close to the smooth 
curve that defines the boundary of A. 

This may be achieved by combining the adaptive 
boundary-finding scheme of section 2.2.2 with a prior 
MRF constraint on the configuration of centers that cor¬ 
responds to a circular lattice (i.e., a closed polygonal 
line). In particular, to every clique of 3 neighboring sites 
(i,j, k) we associate the potential: 

Vijk(m) = i|| - m, + ‘2mj - m k || 2 

(note that m\ and ttim are considered neighbors in a 
circular lattice). 

The combined update rule takes the form: 
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if x t G sf> and ftj 0 < ± k 
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, otherwise 


(24) 


where Xi is the data point chosen at time t. 

An example of the performance of this algorithm is 
shown in figure 6. Note that since in the final config¬ 
uration the centers are ordered, this scheme is in fact 
finding the (discrete) boundary curve from sparse data 
without interpolating the corresponding surface. In this 
sense, it may be said that this algorithm finds the initial 
position, the number of knots and the final configura¬ 
tion of a “snake” [12] that approximates the inter-class 
boundary. 


Figure 6 around here 


-Figure 7 around here_ 

With straightforward modifications, this scheme may 
be used for finding: multiple closed boundaries (fig. 7-a 
and 7—c); open curves that go from one border of the 
image to another (fig. 7-b and 7-d), etc. 

3.2 Local Gaussian Classifiers 

Gaussian Classifiers [5] are a well known class of pro¬ 
cedures for the segmentation of multi-class, multi¬ 
dimensional data. The classification procedure for 
each specimen x E R n involves the computation of Q 
quadratic discriminant functions (one for each class) : 


D k = (ar-/i t ) T Efc l (*-/it) + log|St | 

for k — 1..., Q (25) 

where //*, is the estimated location of the centroid of 

class k; Hk — [ a ijh ls the estimated (n x n) covariance 

matrix and | E* | is its determinant. The specimen is 
then assigned to the class with the lowest value of Dk . 

The “learning” phase consists in the computation of (jl 
and £ from a set of examples {a?i,..., xjv} with known 
classes {ci, ..., c/v}: 


_ E"] x r 6{C{x r ) - k) 
k E ? =i m*r)-k) 

(k) _ ^2r=] %ri%rj <5(C (ff r ) ~ k) 

2^r=l °{C(x r ) - k) 

so that it takes only one pass through the data. 


(26) 

(27) 
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and Poggio in [4]. 

In both cases the training and test error were less than 
5% with 3 centers (3 Gaussians per class), and less than 
8% with one Gaussian per class. 

_Figure 8 around here_ 


3.3 Image Segmentation 

As a final example, we consider an image segmentation 
problem that arises in the processing of certain biomed¬ 
ical images: scintigraphic images [11, 19], which are ob¬ 
tained by counting the number of radioactive particles 
that incide on each cell of a receptor array. The goal of 
the processing step is to obtain from these measurements 
an estimate of the radioisotope distribution in specific 
organs within the human body. 

Particle count and radioisotope concentration are re¬ 
lated by the Poisson distribution formula; therefore, the 
processing step consists in the restoration of a piecewise 
smooth function corrupted by Poisson noise. If it were 
possible to find the boundaries of the organ in question 
(e.g., the heart), the problem would reduce to filtering a 
smooth function within a given domain, for which effec¬ 
tive methods are available (for example, Bayesian esti¬ 
mation methods with MRF priors and quadratic poten¬ 
tials to model the smoothness constraint [20]). In the 
example that we give here, we show that it is possible 
to adapt the methods that we have presented to classify 
the pixels of a scintigraphic image of the heart in such 
a way that one class corresponds approximately to the 
interior of the organ. 

To do this, we will use the following concepts: 

We assume that the two classes are characterized 
solely by the intensity level of the image, i.e., the in¬ 
terior (class 1) has high intensity with respect to the 
background (class 2). It is assumed that the classes are 
fuzzy sets [29] with membership functions of the form: 


0i (z) — - 

1 + exp [—/3(z - 0)] 

<t>2(z) = 1 - <M*) 

where (3 and 0 are positive parameters. 

The formulae for the parameters of the discriminant 
functions of the local Gaussian classifier c are modified 
in the obvious way: 



c _ Y2r x r ( t ) k j z ( x r)) 

ZrMz(Xr)) 

_(c,fc) J2r x r l x r J <f>k(z(x r )) c c 

= ^m^T) —^ 


(29) 

(30) 


where the sums are taken over the learning domain of 
the classifier; x r denotes the coordinates of pixel r of 
the image, and z(x r ) denotes the value of the observed 
intensity. 

The learning domain of each local classifier is taken 
as before, as the Voronoi polygon of a center that sam¬ 
ples the image in an appropriate way. In this particular 
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example, we are interested in segmenting the left ven- 
tricule of the hearth taken from a left anterior oblique 
projection. From this viewpoint, the ventricule appears 
as a high intensity “donut” over a dark background (see 
figure 9-a). Therefore, it is desirable that the centers 
are located uniformly along a closed, smooth curve that 
is attracted towards the higher intensity region of the 
image (note that the quadratic decision surface of each 
local classifier may be an hyperbola, and therefore, it 
can adequately segment a region that looks like a band 
within its domain). 

Since a scintigraphic image is actually representing 
particle counts, we may use update rule (24) directly, 
considering that at each location x r there are z(x r ) data 
points. To get an appropriate behavior for this update 
rule, however, it is necessary that all the data points are 
visited in a random order (see lemma 1). To obtain this 
condition, it is not enough to visit the sites of the lattice 
randomly; it is also necessary, when, each site i is vis¬ 
ited, to 4C fIip a coin”, and only update the corresponding 
center location with probability P up date = z{xi)/z max , 
where z max = max* z(xi). 

_ Figure 9 around here_ 

The results of this procedure applied to the real scinti¬ 
graphic image of figure 9-a are shown in figure 9-b. 
The white squares indicate the center locations, and the 
white line the final compound decision boundary. The 
threshold 0 was obtained as the minimum between the 
two largest peaks of the global histogram of the image; in 
the experiments we performed, however, we found that 
the final results are not very sensitive to the precise value 
of neither this nor the other parameter (/?). 

4 Conclusions 

In this paper we have analyzed the local K-Means algo¬ 
rithm, and have presented some extensions that increase 
its range of applicability. Our main contributions are the 
following: 

i) We have established sufficient conditions for the 
convergence of this algorithm to a (local) minimum 
of a quadratic distorsion measure (lemma 1). In 
doing so, we showed that it can be obtained as the 
limit (as the parameter (3 becomes large) of a fam¬ 
ily of algorithms which are closely related to those 
obtained by minimizing an information distorsion 
measure. For moderate values of /?, we showed 
that these algorithms can be used for unsupervised 
learning (clustering) tasks, since they can find a 
neighborhood structure for the centers that reflects 
the structure of the data. 

ii) We showed that by varying the parameter (3 and 
adding a noise term to the update equation, it is 
possible to improve significantly the performance 
of the algorithm for clustered data. 

iii) We introduced a modification to this algorithm 
that consists in augmenting the state of each pro¬ 
cessor (center) so that it keeps track of its own 
dynamic behavior. With this modification, it is 



Theorem A.2 Let {z n } be a sequence of independent 
random variables with zero mean. Then, if 

oo 

E < 00 

n~ 1 

the series z n converges with probability 1. 

In order to apply this theorem to the random variable 
Si , we first need to check that Si has zero mean. The 
probability distribution of the random variables £ has 
been assumed to be uniform, and therefore P(£i) = 

We then have: 


E[ Si ] = ai E[VF- V/fc —V/ 0 ] = 

= a ( [VF - E[Vf u ) - Vf 0 ] = 

1 N 

= ~ ^ E - v/ 0 ] = o. 

u = 1 

Similarly we have 

£[*, 2 ] = a^[(VF - V/ ft - V/o) 2 ] = OjC(y) 

for some function C(y). If we assume that 

00 

E a « < 00 

n = l 

then E “=1 E[ s %\ = c (x) En°=i a n < +°°- Therefore 
the series ]e™ = 1 converges with probability 1, assump¬ 
tion A4 holds, and lemma 2.1 is proved. 

Note that the assumptions about the boundedness of 
the sequence {y n }> and the boundedness of the gradi¬ 
ent of the fi are not as restrictive as they seem, and in 
practice may be dropped. In fact, we can restrict the 
sequence {y n } to a compact region G C R r , considering, 
instead of eq. (31), the following sequence: 

Vn -\-1 — I ^G[yn ~ a n(^ foiVn) T ^/f n (j/n)) T a iPhi\ (33) 

where IIg is the projection onto G. In this case, lemma 
2.1 now holds if we substitute VF in eq. (12) with 
IIgVF. Choosing G sufficiently large to contain the 
fixed points of interest, the conclusion of the theorem 
is practically unchanged. Moreover, since now the se¬ 
quence is restricted to the compact set G, the derivatives 
of the fi are certainly bounded on G , being continuous, 
and therefore the assumpions of boundedness of the gra¬ 
dient of the fi can be dropped. 
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